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Abstract: A maximum likelihood based model selection of discrete Bayesian networks is 
$-H , considered. The structure learning is performed by employing a scoring function S, which, for 

a given network G and n-sample D n , is defined as the maximum marginal log-likelihood I 
1 minus a penalization term X n h proportional to network complexity h(G), 

^ '■ S(G\D n ) = l(G\D n ) - X n h(G). 

An available case analysis is developed with the standard log-likelihood replaced by the sum 
of sample average node log-likelihoods. The approach utilizes partially missing data records 
' 1 ' and allows for comparison of models fitted to different samples. 

t™^ ' In missing completely at random settings the estimation is shown to be consistent if and 

only if the sequence A n converges to zero at a slower than n~ 1 / 2 rate. In particular, the BIC 

* model selection (A n = 0.5 log(n)/ra) applied to the node-average log-likelihood is shown to be 
i-p^ ■ inconsistent in general. This is in contrast to the complete data case when BIC is known to be 

' consistent. The conclusions are confirmed by numerical experiments. 
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1. Introduction 

The continuing interest in developing sparse statistical models, with the notable presence of Bayesian 
networks among them, is well motivated by a number of pressing practical problems coming from 
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gene/protein expression analysis and medical imaging, to mention a few. Although graphical proba- 
bility models based on directed connections between random variables provide efficient joint distri- 
bution description, the application of such models is often limited by the ambiguity of their observed 
behavior which makes the learning rather difficult. 

One of the prevailing approaches to graphical model selection is through optimization of some 
scoring functions. In the context of Bayesian networks, the usual choice is the log of posterior. 
Let (G, 8) be a Bayesian network with graph structure G and probability model parameter 9 £ 
0. Following the Bayesian paradigm (see for example [5] and [11]), one specifies prior probability 
distributions n for G and 9. Then, for a sample D n of size n, one considers the Bayesian scoring 
function 

S{G\D n )=logn{G)+\ogC{G\D n ), 

where 

C(G\D n ) = f £(G,9\D n )Tr(8)d8 

is the so-called marginal likelihood of G, while C(G, 9\D n ) is the usual likelihood of (G,8). The 
Bayesian scoring function measures the posterior certainty under the chosen prior system and the 
model with maximum score is thus a natural estimator. 

The main virtue of the Bayesian approach is in counter-balancing the tendency of the maximum 
likelihood estimation to choose the most complex model fitting the data. As first noticed by [10], 
when the probability parameter space constitutes an exponential family in an Euclidean space, 
the marginal log-likelihood of a model M admits the approximation 

log£(M|A0 = BIC(M|D n ) + O p (l), 

based on the so-called Bayesian Information Criterion (BIC), 

BIC(M|D n ) = log£(M,0 M |£>„) - 0.51og(n) dim(Af), 

where 9m is the value of 9 that maximizes the log-likelihood for given M and D n , and dim(A/) is 
the dimension of M . The immediate application of this result to discrete and conditional Gaussian 
Bayesian networks was postponed because of the non-Euclidean structure of the parameter space 
for these models. This obstacle was later overcome by [7], who showed the validity of the BIC 
approximation for a much large family of curved exponential distributions. 

In a later work, [6] applied this result to several families of Bayesian network models including 
the discrete ones, thus showing the asymptotic consistency of BIC. In its generality, the parameter 
space ©g of a discrete Bayesian network G comprises a collection of multinomial distributions and 
the total number of parameters needed to specify them is what is understood as dimension of 0q. 
The BIC approximation is then expressed as 

log £(G\D n )= log £(G,9 G \D n )-0.5\og{n) dim(0 G ) + O p (l). (1.1) 

Equation (1.1) suggests a more direct estimating procedure - selecting a model G in Q with 
maximal BIC score. There are two typical arguments in favor of this route versus the Bayesian one. 
The first one is methodological - prior based inference is not universally accepted. The other one is 
computational - calculating marginal likelihoods can be prohibitive, especially so in the framework 
of large dimensional graphical models. 

These observations have motivated us to pursue the latter, non-Bayesian approach - maximum 
likelihood estimation followed by model selection according to some scoring criteria. To generalize 
it, we reformulate the right-hand side of (1.1) and consider the following estimation problem 

G = arg maxin' 1 log C(G,9 G \D n ) - X n h(G)}, (1.2) 
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where A„ is some positive sequence and h is a function measuring the complexity of G. The class 
of problems (1.2) is known as extended (or penalized) likelihood approach [3]. Typical penalization 
parameters are A„ = 0.5n _1 log(n) (BIC) and A n = (AIC), while dim(<dc) is a usual choice for 
h. We briefly remark that, in order to be useful in practice, the estimation problem (1.2) relies on 
two assumptions: (1) for a fixed G, the MLE Qq can be easily found, and (2), the set of networks Q is 
not prohibitively large, which usually requires imposing some network structure restrictions. In this 
paper however, we are mainly concerned with the theoretical aspects of (1.2) - to our knowledge, the 
consistency properties of G are not investigated in presence of missing values - and present results 
which are relevant to all estimation algorithms involving penalized log-likelihood of this form. 

The paper contributes in three main directions. First, in order to more efficiently handle data 
with incomplete records, we modify the scoring based model selection (1.2) by replacing the log- 
likelihood function with what we tentatively call node-average log-likelihood (NAL) - a sum of 
sample average node log-likelihoods relative to the node parents. The NAL statistics utilizes partially 
incomplete sample records instead of discarding them and provides means for comparing models 
fitted to different samples. We argue that when the number of nodes is large in comparison to the 
parent sizes, the NAL-based estimation achieves efficiency close to that of the computationally more 
demanding Expectation Maximization (EM) procedure [8]. Second, we focus on missing completely 
at random data models for they essentially guarantee network identifiability. More general missing 
at random mechanisms, in most cases, obscure the underlying network structure and render the 
network unidentifiable. Third, we generalize the scoring criteria by allowing the complexity measure 
h to be any positive function (as long as it is increasing for G as defined later) and a continuum 
of penalization parameters A„ = 0{n~ a ) by specifying a range of possible values a for which the 
estimation is consistent. 

In Section 2 we introduce the notion of node-average log-likelihood and describe the model se- 
lection problem in the context of Bayesian networks. Then, Section 3, we consider the question of 
network identifiability and formulate consistency in terms of scoring criteria. For the latter we follow 
[7] and [4]. We show in Section 3.1 that if the data is missing completely at random, the identifiability 
arises under some natural conditions. Section 4 presents the main result in this paper, Theorem 4.1, 
claiming that the estimation is asymptotically consistent provided that A„ goes to zero at slower 
rate than -nT 1 , in the complete data case, and n -1 / 2 , in presence of missing data. We also show the 
necessity of the later in missing completely at random settings. Thus, the inconsistency of AIC is 
(re)confirmcd along with somewhat unexpected conclusion regarding the BIC criteria - in the context 
of NAL optimization, BIC is consistent when applied to complete data but inconsistent otherwise. 
In Section 5 we present some numerical results in confirmation of the theory which are carried out 
with the catnet package for R. We conclude with a short discussion on possible extensions of the 
presented approach beyond the class of discrete Bayesian networks. 

2. Problem formulation and motivation 
2.1. Basic definitions 

Let X = (Xi)f =l be a N- vector of discrete random variables. Any directed acyclic graph (DAG) 
G with nodes X is a collection of directed edges from parent to child nodes such that there are no 
cycles. We denote with Pai the parents of node Xi in G; then G is completely described by the 
parent sets {Pai}f =1 . The set of all DAGs with nodes X admits partial ordering. We say that G\ is 
included in G2 and write Gi C G2 if all directed edges of G\ are present in G2 as well. An element 
G of a set of DAGs Q is called minimal if there is no G <G Q such that G C G; similarly defined are 
maximal DAGs. In a set of nested DAGs, the minimum and maximum DAG are always uniquely 
defined. 

Discrete Bayesian network (DBN) on X is any pair (G, P) consisting of DAG G and probability 
distribution P on X subject to two conditions: 
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(1) the joint distribution of X given by P satisfies the so-called local Markov property (LMP) 
with respect to G - any node- variable is independent of its non-descendants given its parents, 

(2) G is a minimal DAG compatible with P, that is, there is no G C G such that P satisfies LMP 
with respect to G. 

For any DAG G, there is an order of its nodes, called causality order, such that the parents of 
each one appear earlier in that order. We say that G is compatible with an order f2 if Pan(i) = 
and for all i = 2, ...,N, Pau(i) C {-X"n(i), — ,-Xn(i-i)}- For i < j, we denote with A 0(i ) -< X n ^ the 
fact that -X"f2(i) appears before -X'n(j) m the order f2. 

In its generality, the discreteness of our model implies that for each state xp ai of the parents 
of Xi, the probability distribution of Xi conditional on Pai = xp ai is multinomial. Moreover, the 
conditional probability tables {P(Xi\Pai)}^ =1 fully specify the joint distribution of X. Indeed, let G 
be compatible with an order f2, that is, -< Xq^) ■■■ Xn(N)- Then, taking into account the 

LMP, it is evident that with respect to G, the joint probability distribution permits the factorization 

N N N 

P(X) =Y[P(X m \X n(1) , ...,X n(j _ 1) ) = l[P(X n(i) \Pa m ) = J[P{X i \Pa i ). 

i—1 i—1 i—1 

Depending on the context, in a pair (G, P), we shall refer to P either as a joint distribution, such 
as in the left-hand side of the above display, or as a set of conditional probability tables, as in the 
right-hand side above. 

For any DAG G, there is a maximal set 1(G) of (structural) conditional independence relations 
of the form (A X B\C), for A,B,C C X and A,P ^ 0, determined by LMP [9]. On the other 
hand, P also defines a set of (distributional) independence constraints on X. Condition (2) in the 
definition of DBN is needed for assuring that the sets of structural and distributional independence 
statements in fact coincide. We say that two DAGs Gi and G2 are equivalent and write Gi = G2 if 
I(Gi) = I(G2)- With [G] we shall denote the class of DAGs equivalent to G. Necessary and sufficient 
conditions for DAG equivalence can be found in [13, 4]. We call two DBNs (Gi,Pi) and (G2,p2) 
equivalent if their joint distributions are equal, Pi = P2, which implies equivalence between their 
graph structures, Gi = G2. The essential problem of BN learning is the recovery of the equivalence 
class [G] from data. 

Another useful notion is that of network complexity. The complexity of a DBN G is typically 
measured by the number of parameters df(G) needed to specify the conditional probability table of 
G. Let q(Xi) be the number of states, or discrete levels, of Xi and q(Pcii) = IlxePa <?P0 ^ e the 
number of states of the parent set Pa^. Since for every state of P<Zj, q(Xi) — l parameters are needed to 
define the corresponding multinomial distribution for Xi, we have df(G) = 53i=i Q(P a i){<l{Xi) — 1). 

Next we formulate the maximum likelihood estimation (MLE) in the context of DBNs. Let D n = 
{x s }™ =1 be a sample of n independent observations on the vector X. Then, the log-likelihood of a 
DBN (G, P) with respect to D n is 

N n 

log C(G,P\D n ) = ]T]T log P(A ? = xi\P ai = x P J, (2.1) 

i=l s=l 

where xf and x s Pa . are the states of Xi and its parent set Pdi in the s-th record X s . According to the 
ML principle, a DBN estimator can be obtained by maximizing (2.1). Before proceeding with the 
inference in presence of missing values we need to introduce some useful statistics and convenient 
notations. 

We write k £ Xi to index the states of Xi and adopt a multi-index notation, j £ Pai, for the 
parent configurations of Xi. Let li,kj be the indicator function of the event (Xi = k,Pai = j). 
For a given sample D n let us define the counts n^jy = 2™=i ^i,kj{x s ), riij = Y^kex ■ n i,kj an d 
n i = SjePa n i,j- A record X s in D n we shall call incomplete if some of the values xf are missing. 
By convention, if the value of Xi in x s is missing, then \i^j{x s ) = 0, while if some of the parents in 
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Pen are missing, then both li t kj(x s ) = and lij(x 3 ) = 0. It is always the case then that rij < n. 
We shall consider an inference framework using the counts riij and n^kj as statistics summarizing 
the information in the sample D n . 

Let Z = (Zi)fL 1 be a binary random vector such that Z» = 1 if A^ is observed and Zi = if it is 
missing. For an index set A we define Za = YiieA ^i- The joint distribution of (X, Z) describes all 
incomplete samples D n of observations on X. 

Let us introduce the probabilities 9i, 9i t j and 0i t kj as 

9 i = P(Z i = \,Z Pai = \) 

9 i , j =P(Pa i =j\Z i = l,Z Pai = 1), 

e iikj = P(Xi = k\Pa t = j, Z t = 1, Z Pa% = 1). (2.2) 

With 9 we shall denote the set {9i,9i t j,9i i kj}i,k,j and call it observed conditional probability table 
of G. 

For a sample of fixed size n, the random variables n, and the random vectors {^i,j}jeFai and 
{«i,fej}fteXi then satisfy 

rii\n r~j Binom(0i,n) 

{ n i,j}j\ n i ~ Multinom({9ij}j 1 ni) 

{".j,, \h a,., ~ Multinom({9 it kj}k,n'i,j). (2.3) 

Therefore, as long as nj, riij and n^jy are of interest, the table is all we need to know about the 
DBN and the mechanism of missingness. 

The usual point estimators of 9i, Q^kj and are 

f) — — ft — ' f }iA n — ni - k i 

"i 5 ^i,i — ' "i,kj — 

n mj 

We shall denote the conditional tabic defined by ^'s with 9(G\D n ) to emphasize that it is estimated 
for the DAG G from the sample D n . The statistics 9ij and 9i t kj are unbiased estimators of 9ij and 
9i,kji respectively 

E9ij = E nz E{ — \ni) = 0i d , E9 iM = E n ^E(^\n^) = 9 iM . (2.4) 

The missing data distribution usually belongs to one of the following categories: 

(i) The data is missing completely at random (MCAR) when the missing probabilities arc unre- 
lated to either the observed or the unobserved values. In this case Z is independent of X and we 
have 9i t j = P(Pcti = j) and 9i^j = P{Xi = fc|Pa, = j). 

(ii) The data is missing at random (MAR) when the missing probabilities depend on the observed 
values but not on the unobserved ones. Let us consider a special case of MAR when for each i, there 
is Ci C X such that Xi ^ Ci, Cj fl Pai = and (Zi, Zp ai ) is independent of (Xi,Pa,i) given C{. If 
furthermore Ci has no descendants of Xi, then, by application of LMP, 9i ; kj = P{Xi = k\Pai = j) 
holds. For a general MAR however the latter may not be true. 

(iii) If the missing probabilities depend on the unobserved values we have not missing at random 
(NMAR) case and then neither fty = P(Pa, = j) nor 9i t kj = P(Xi = k\Pai = j) hold anymore. 

As we discuss in Section 3.1, the missing data distribution is implicated in network identifiability. 
In this regard, the MCAR model is the most transparent one for it does not interfere with the 
network topology. 
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2.2. Node-average log-likelihood 

We consider two objective functions for estimating DBNs based on the log-likelihood (2.1). The first 
one is the sample average log-likelihood 

1 N 

l(G\D n ) = ~2J £ £ n i,kj io E^i,kj 

i=i jePai keXi 

N 

= £ £ ^E^w^w ( 2 - 5 ) 

When the data has no missing values we have nl(G\D n ) = maxg log£(G, 8\D n ). 
The second objective function is the sum of sample average node log-likelihoods 

N 1 

k^iao E- E E lo §^ 

iV N 

= E E ^ E ^ lo %kki =J2l{X i \Pa i ,D n ), (2.6) 

i=i jePa; keXi i=i 

where l(Xi\Pai, D n ) = X)jep a ®iJ^2keX ■ ^-°S ^t,fcj i s known as negative conditional entropy 
of node Xj. Hereafter, we drop the qualifier 'sample average' from (2.5) and (2.6) and call (2.6) 
node-average log-likelihood (NAL). 

If D n is a complete sample, then for every i, rii = X^ep a n «-i = n - Hence OijO^kj = ni,kj/n 
and consequently l(G\D n ) = l(G\D n ). If the data is incomplete however, we may have rij < n 
and then (2.5) and (2.6) will be different. In the latter case, the log-likelihood (2.5) may have 
unbalanced representation of the potential parent sets. For example, if for two different parent sets 
Pai and Pa'i of the i-th node rii(Pa^) < rii(Pai), then Pa\ might be preferably selected due to the 
smaller size of the subsample that represents it in (2.5) even when Pa\ has worse fit than Pai, i.e. 
^XilPa'ijDn) < l(Xi\Pai,D n ). The simplest solution to this problem - discarding all incomplete 
records in the sample - may drastically reduce the effective sample size. On the other hand, (2.6) 
can utilize all rii sample records for estimation of Oi^j's. Essentially, NAL exploits the decomposable 
nature of the log-likelihood (2.5) and, by adjusting for the sample size, allows comparison of models 
fitted to different samples. We mention that, similarly, NAL can be adopted in other decomposable 
log-likelihood based models. 

It can be easily demonstrated that the maximum likelihood principle alone is inefficient for esti- 
mating DBNs. Let us assume for simplicity that Q comprises all DBNs with node order compatible 
with the index order, -< Xi -< ... -< Xn- The maximum NAL equation, G = arg max l(G\D n ), 
will then result in the following estimates for the parents set Pai 

Pai = arg max §i V" 0; uj k>g$i ,kj- 

p ai c{i i-n ^— ' 

kex. 

From the increasing property of the conditional log- likelihood (see Lemma 7.1 below) it follows that 
the solution of the above equation is Pai = {1, i — 1}, for every i > 1. Thus, the MLE solution will 
be the most complex DBN in Q and will overestimate the true G. In the remainder of this paper we 
shall investigate more closely the properties of NAL-based estimation in a model selection context 
and shall provide criteria for asymptotically consistent estimation. 
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2.3. Relation between NAL maximization and EM algorithm 

In missing data settings, the standard way to utilize all of the available data is to apply an EM 
algorithm - see [8] for application of EM to Bayesian networks. For a sample D n let D° 6s be the 
observed part of the data. The EM algorithm involves the following conditional expectation 

Q(G, P\G', P') = E(\og P(D n \G, P)\D° n bs ,G', P') 

N n 

= E E E(Y,l {x! =k^ Pa = j} \D n bs ,G',P')\ogP^ kj , 

i=l jePai,keXi s=l 

where P^kj = P{Xi = k\Pcii = j). Finding Q implements the E-step of the algorithm. The M-step 
maximizes Q(G, P\G' , P') for G and P. Solutions of the EM algorithm are all (G,P) such that 
Q(G, PIG, P) = max G:P Q(G, P\G, P). 

It can be shown that NAL maximization is equivalent to solving a sub-optimal EM algorithm with 
S s =i ^{x s =k,x s p =j} replaced by the sum ni t kj+n™j, where n^jy and n™? are the number of records 
in D n for which the event (JQ = k, Pat = j) is observed and missing, respectively. For each i , this is 
equivalent to replacing £>° bs by a sub-sample D° b f with all x s from £>° bs for which (X,, Pai) is not 
fully observed being removed. Let n"" s = ^2 k . n™j = n—n%- Given Z?° bs , n^jy, n% and n™ s are fixed 
but n™j is random. In fact, n™j£ conditional on (n.j = n™ s , G", P') follows a Binomial distribution. 
Since S(E:=i l{ x t=k, x > Pa =j}\D%!,G',P') = m, kj + E(n™\m = n? is ,G',P'), we define 

N 

Q*(G,P\G',P')=J2 E (n i ,k j +E(n^ s j \n i = nr s ,G',P'))logP i , kj . 

i=l j<zp ai ,k£Xi 

Under the MAR assumption P(X i ,Pa i \Z i ,Z Pai ,G',P') = P(X h Pa^G' , P'), we have E(<n$g) = 
E n7:r (n™]°P> kj ) = n';-r;j';, r Therefore 

N 

Q*(G,P\G,P') = J2 E (r^kj + in-n^P^logP^. (2.7) 
•i=i jePcu.keXi 

We then observe that P >->• Q*(G,P\G,P) is maximized for P^j = n^j/ni = Oi.j and Pi,kj = 
ni,kj/ni t j = <),.;. ,. and consequently 

N 

Q*(G,P\G,P)=nJ2 E hi E Pi,k^ogPi, kj =nl{G\D n ). 

i=l j£P ai keXi 

We hence conclude that the EM algorithm based on Q* essentially maximizes the NAL function 
(2.6). Of course, Q utilizes all of the available data, while Q* does not - when even one component 
of (Xi,Pdi) is missing, Q* treats the entire record as missing, while Q tries to use the available 
information by calculating (often costly) conditional expectations. Nevertheless, the NAL-bascd 
inference is much more efficient than the naive approach that ignores all records for which at least 
one component of X is missing; even more so in cases when the dimensionality N is much higher 
that the maximum size of IPaq's (the so-called in-degree). In such cases the difference between 
Q and Q* is less pronounced (if rii >> n — n.i then \Q — Q*\ << \Q*\) and so is the difference 
between NAL maximization and EM algorithm. Moreover, the sub-optimality of NAL maximization 
is counterbalanced by its computational simplicity. The EM algorithm is usually intractable for 
data with number of nodes in the thousands while NAL optimization may still be a possibility. In 
conclusion, the NAL-based learning seems to be an effective and computationally more affordable 
alternative of EM for estimating high dimensional, low in-degree Bayesian networks. 
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3. MLE and model selection 

Let (Gq,Pq) be a DBN with nodes X, parent sets -Pa°, and observed conditional probability table 
Qq. For an arbitrary DAG G with nodes X and parents Pcii, we consider probability distribution 
Pg\g on X induced by Go which, for a state x of X, is given by 

N 

P GlGo (x) = Y[P (Xi = Xi\P ai = x Pai ) (3.1) 

i=l 

and compare it to 

N 

Po{x) =Y[Po(X t = Xi\P<# = X Pa o). 
t=l 

In general, Pg\g ^ s different from P$ and (G, Pg\g ) ma y n °t De wei l defined DBN, because G is not 
necessarily a minimal DAG compatible with Pg\g ( see condition (2) from the definition of DBN). 
However, if G is a minimal DAG such that Pg\g = Po, then G = Go- 

We also consider the following observation probabilities of G induced by Go 

9 i (G\G )=P(Z i = l,Z Pai =1) 

e itj {G\G Q ) = P(P ai = j\Zi = 1, Z Pai = 1) 

O ilkj {G\G ) = P(X % = k\P ai = j, Zi = 1, Z Pai = 1) 

where the probabilities are with respect to the joint distribution of Z and X|(Go,Po)- Recall that 
according to (2.2) the entries of 0$ are 

Q\ = P{Z i = \,Z P< = \) 

6i 3 =P(Pa°=j\Z l = l,Z Pa « = l) 

9% hj = P{X t = fc|Pa° = 3 , Z t = 1, Z Pa o = 1). 

Let #(G|Go) denote the corresponding conditional probability table with entries 6>;(G|Go), 6>ij(G|Go) 
and tfij^GIGo). Clearly, we can write 6>(Go|Go) = . Moreover, in the important case when Z is 
MCAR we have 

MG|G ) m = r Po(P<>i=j) 

Oi, kj (G\G ) "= r P (A 4 = k\P ai = j) 

and 6*(G|Go) is the conditional probability table corresponding to Pg\g - 
Next, we define the NAL of G with respect to Go given by 

N 

l(G\G ) = Y,l(X t \Pa. l7 Go), 

l{Xi\Pa h G )= ]T 9 itj kkjlog9 i>kj , (3.2) 

where 6 i: j = 9ij(G\Go) and 9i t kj = #i,fcj(G|Go). Essentially, l(Xi\Pa,i, Go) is the observed population 
negative entropy of A, conditional on Pa, and Z(G|Go) is the population version of (2.6). For brevity, 
we shall write l(Go) instead of Z(Go|Go). 
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3. 1 . Identiflability 

Let Go belong to a collection Q of DAGs with nodes X. If D n is an independent sample from a 
DBN (Go,Pq), by the strong law of large numbers, for any fixed G GQ, 6^kj{G\D n ) — > (G|Go), 
a.s., and hence, l(G\D n ) — > 1(G\Gq), a.s. as n — > oo. A necessary condition for MLE consistency 
is the identiflability of Go, which in its usual sense requires '(G|Go) < '(Go) for all G <G Q such 
that G ^ Go- The latter is a strong requirement however, for thus defined the identiflability will 
never hold unless Go is a maximal DAG in Q that contains Go - as we show later (Lemma 7.1) 
l(Xi\Pdi, Go) is a non-decreasing function of Pa^. In the light of this observation we shall adopt a 
more appropriate definition of idcntifiability, one that assumes smaller likelihoods only for the DAGs 
not containing the true one. To simplify the notation, hereafter we shall refer to the DBN (Go,Po) 
simply as Go- 

Definition 3.1. We say that Go is identifiable in Q, if for any G £ Q we have '(G|Go) < '(Go) 
when G C G and l(G\G ) < l(G ) when G £ G. 

Note that the identiflability of Go depends on the joint distribution of X and Z. The utility of 
this definition is due to the following observation. If Go is identifiable in Q, then 

G* = min{G G Q \ l(G\G Q ) = max'(G|G )} = G , (3.3) 

implicitly assuming the existence of unique such minimum G* (in general we may have multiple 
minimal G maximizing the NAL). Moreover, it is easy to check that (3.3) is a necessary and sufficient 
condition for identiflability In 'learning from data' settings, we can replace Z(G|Go) in (3.3) with 
l(G\D n ) and find an estimator G* of the minimal DAG G* , exhaustively in Q or by some more 
efficient algorithm. Then G* would be an estimator of Go as well. In this way, the identiflability 
assures the principal possibility of recovering Go- 

It is intuitively clear that in order to recover the graph structure Go from incomplete samples, the 
missing data mechanism should not interfere with the associations between X^s determined by Go- 
This condition is satisfied for any MCAR model. In more general MAR settings, the identiflability 
of Go depends on the interaction between X and Z and can not be judged without actually knowing 
Go- We thus regard the MAR assumption as not significant generalization over MCAR due to the 
practical impossibility to check it prior to learning. 

The next result shows that in MCAR settings the population NAL does not increase when the 
true DBN is nested in a larger one, and moreover, that its maximum is achieved only for DAGs 
equivalent to the true one. 

Proposition 3.1. If Z is MCAR, we have the following: 
(i) if G C G then l(G\G ) = l(G ); 

(ii) maxgZ(G|Go) = '(Go), where the maximum is over all DAGs on X; 
(ni) ifl(G\G ) = '(Go), then P G \ Gq = P . 

From these properties of the NAL of G with respect to Go we can draw two immediate conclusions 
as stated in the next two corollaries. 

Corollary 3.1. IfL is MCAR then Go is identifiable in any set of DAGs compatible with its order. 

Therefore, provided a true node order is known (that is an order with which Go is compatible; 
there might be many such orders), Go can be recovered from the set of all DAGs compatible with 
that order. 

We can further extend Definition 3.1 to account for classes of equivalent DBNs. Recall that, 
ultimately, it is the independence relation set Z(Go), shared among all equivalent to Go DBNs, that 
is of main interest. In the view of condition (3.3), we say that [Go] is identifiable in Q if 



min{G e G | l(G\G ) = max/(G|G )} = G , 



(3.4) 
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in the sense that any minimal G that maximizes the NAL 1(G\Gq) is equivalent to Go (we also 
assume that the set on the left is not empty). Proposition 3.1, cases (ii) and (Hi), implies that the 
maximum NAL is 1(Gq) and any G that attains this maximum satisfies Pg\Go = Pq- ^ m addition 
G is minimal, then (G, Pg\g ) ^ s a wc ^ defined DBN which is equivalent to (Go, Pq) and hence (3.4) 
is satisfied. We have thus obtained the following. 

Corollary 3.2. IfZ is MCAR, then [Go] is identifiable in any Q that contains at least one element 
of [Go]. In particular, [Go] is (globally) identifiable in the set of all DAGs on X. 

As defined, the idcntifiability of the equivalence class [Go] depends implicitly on the choice of 
log-likelihood proxy function. Note that [Go] is not guaranteed to be identifiable, even in MCAR 
settings, if in (3.4) we replace the NAL I with the standard log-likelihood I from (2.5). 

3.2. NAL-based scoring functions 

As we have observed earlier, the MLE criteria selects the most complex BN in Q containing Go and 
unless some complexity penalization is imposed, the MLE is prone to overfitting. Methodologically, 
there are two approaches addressing the model selection problem. The first one is provided by the 
Bayesian paradigm, where the parameter (G, 9) is assumed coming from some prior distribution and 
one looks for the maximum posterior estimator. The second, frcqucntist, approach is to optimize a 
scoring function based on the log-likelihood and additional complexity penalization term - a penalized 
log-likelihood. We consider a general scoring function of the form 

S(G\D n ) = l(G\D n ) - \ n h(G), (3.5) 

where A„ are positive numbers indexed by the sample size n and h(G) is a positive function accounting 
for the complexity of the G. When needed, we shall write Sh to specify what h is meant. The role 
of the sequence A„ is to apply a proper amount of penalty that guarantees estimation consistency. 

One can employ different measures for network complexity. Any complexity function h is assumed 
to be increasing in the following sense: for any two DAGs G\ and G2 such that G\ C G2, G\ ^ G2, 
we have h(G\) < h(G2)- In regard to DBNs, a typical choice is the total number of parameters df(G) 
needed to specify the multinomial conditional distributions of G, that is, the number of independent 
parameters in 8. 

We return to (3.5) with some typical examples. Since the NAL l(G\D n ), being sum of node 
sample averages, is normalized by the sample size, the standard model selection criteria AIC and 
BIC, formulated in terms of the scoring function (3.5) are given by A„ = 1/n and A„ = 0.51og(n)/n, 
respectively. The so called minimum description length (MDL) score, representing the information 
content of a model, is given by \og(n)df(G)/n and is equivalent to BIC. 

Similarly to NAL, often, the chosen overall DBN complexity can also be represented as a sum of 
node-wise complexities. For example, df(G) = J2 i df(X i \Pa i ), df(Xi\Pai) = (q(X±) — l)q(Pa,i). In 
such cases it might be more appropriate to replace A„ with node-specific penalization A n< 's 

N 

S(G\D n ) =J2iKXi\Pai,Dn) - \ n M x i\ p ai)}- (3-6) 

i=l 

We shall refer to these as decomposable scores. Typically, one uses one and the same function of n 
to express A ni 's, such as A„ = Ao«~ Q , a e (0, 0.5). The decomposable BIC criteria then is 

S B ic(X l \Pa t ,D nil ) = l(Xi\Pa h D n ) - 0.5 l ^^-df(X i \Pa i ) 

N N 

S BIC (G\D n ) =Y J S B ic(X i \Pa i ,D n ) = ^ i —BIC[X i \Pa i ,D nti ) (3.7) 
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where P> n ,i is the sub-sample of D n of size n, for which [Xi,Pai) is observed and BIC(Xi\Pai, D n ,i) 
is the original BIC criteria, (1.1), applied to the regression model Xi\Pa,i. 

As we have stated in the introduction, we consider an MLE based model selection by maximizing 
5* as a function of G given a sample D n , 

G = arg max S(G\D n ). (3.8) 

Note that we do not maximize S for G and 9 simultaneously. We estimate 9 for each G using the 
plug-in estimator 9(G\D n ) and then the DAG with maximal score is chosen as graph structure 
estimator. In what follows we show that, by solving (3.8) for proper A n , we can obtain consistent 
estimation of the true model with no further conditions on h. 

Let G be the estimator (3.8) for a sample D n coming from a DBN Go- Then the following claim 
is immediate. 

Proposition 3.2 (Consistency Criteria). Provided for any G\ € Q and G2 € Q the following two 
conditions are satisfied 

(CI) if G C Gi but G £ G 2 , then P(S(G 1 \D n ) > S(G 2 \D n )) 1, as rw 00, 

(C2) if Go C Gi, G C G 2 and h{G x ) < h{G 2 ), then P(S(Gi|L>„) > S{G 2 \D n )) 1, as n -+ 00, 

G is a consistent estimator of Go, that is, P(G 7^ Go) — > 0, as n — > 00. 

The conditions (CI) and (C2) are relaxed versions of those used in [4]. In fact, the consistent 
scoring criterion in [4] is a special case of the more abstract formulation of model selection consistency 
in [7]. We end this section with the following important observation. 

Corollary 3.3. // conditions (CI) and (C2) are satisfied for any DAG equivalent to Go, then [G] 
is a consistent estimator of [Go], that is, P(T(G) ^ I(Gq)) — > 0, as n — > 00. 

4. Estimation consistency 

Let (Go,Po) be a DBN with conditional table 6*o in a set of DAGs Q and D n be an independent 
sample drawn from it. In this section we investigate the consistency of the estimators G and [G] with 
respect to a scoring function S, where G is given by (3.8). 

As we have observed earlier, if the data has missing values, it is not anymore true that l{G\D n ) = 
l(G\D n ), the usual sample average log-likelihood (2.5). Therefore, (G,9) is no longer an MLE for 
(Go, $o) and the standard consistency results from the asymptotic theory are not directly applicable. 
A proper account for the incompleteness of the data is thus needed. 

For a sample of fixed size n, the random variables ni and the random vectors {ni,j}jePa.i and 
{niM]}k£Xi satisfy 

{ n i,j}j\ n i ~ Multinom({6i t j(G\Go)}j,ni) 

{»../., h- ».., ~ MulUnom({0 iik j(G\Go)}k,ni,j) ( 4 -l) 

and the statistics 9ij and 6i t kj are unbiased estimators of 9i.j(G\Go) and 9i t kj (G|Go), respectively. 
Moreover, if Go is identifiable in Q, then for each i, the probability of the event '(Xi ,Pa®) is observed' 
must be strictly positive, i.e. 9® > 0. Since Q is always finite, the following is well defined 

N 

= minmin{^(G|G )|^(G|Go) > 0} (4.2) 
GeS i=l 

and (3(G) > 0. The complete data case can be thus represented as (3(G) = 1. Note that (3 depends 
implicitly on the distribution of Z. 

The next result establishes the rate of convergence of the empirical NAL to the population one 
without imposing any restrictions on the distribution of Z or on Go (Go need not be identifiable). 
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Lemma 4.1. Let D n be sample from a DBN (Go,Pq)- Then for any DAG G 

l(G\D n )-l(G\G ) = O p (n- 1 / 2 ), (4.3) 

which implies l(G\D n ) — > p l(G\Go). 

Providing conditions for scoring function consistency is our next goal. Let us assume that Go is 
identifiable in Q. In the light of Lemma 4.1, if G does not contain Go, then there is a positive constant 
5 such that /(Go|£>n) — l(G\D n ) > S with probability going to 1, as n —> oo. It is evident therefore 
that if the sequence A rl diminishes with n, X„ —> 0, then, asymptotically, the scoring function Sh will 
select an estimator that contains the true model Go regardless of the chosen complexity function 
h. In addition however, we want that estimator to get close (in sense of the complexity measured 
by h) to Go with the increase of the sample size. Since for any G such that Go C G we have 
l(G\D n ) — 1(Gq\D„) — > p 0, the latter can be assured if we require A„ to diminish at a slower rate 
than that of l(G\D n ) — ^(Go|-D n ). We show that this rate is n" 1 for complete samples and n -1 / 2 in 
case of missing data. 

We moreover show that the consistency sufficient conditions, A n = o(l) and n~ 1 / 2 X~ 1 = o(l), 
become essentially necessary. More precisely, the necessity is guaranteed if the following condition is 
satisfied. As usual Pai and Poff denote the parent sets of G and Go, respectively. 

Condition 4.1. There are G £ Q with Go C G and i £ {1, N} such that Paj = Pa° for all j ^ i, 
Pa^Pa® ^ and P(Z Pa .\ Pa o =l\Zi = 1, Z Pa o = 1) £ (0, 1). 

In words, the condition refers to the possibility of extending the parent set of a node of Go by 
one or more new nodes that are, conditionally, neither always observed nor never observed (thus Go 
must not be a maximal DAG in Q). 

Next, we summarize the above observations in the following theorem. 

Theorem 4.1. Let Go be identifiable in Q and S be a scoring function (3.5) with penalization 
parameter A„ such that X n — > 0. The following are satisfied. 

(i) If (3(Q) £ (0, 1) and \fnX n — > oo, then G is consistent estimator of Gq. 
(ii) If (3(Q) = 1 and n\ n — ¥ oo, then G is consistent estimator o/Go- 

(Hi) If 7i is MCAR, Condition 4-1 holds and lim ^/nAn, < oo, then G is inconsistent estimator of 
Go- 

The complete data case of the theorem, (ii), also follows from a more general result by [7] (Propo- 
sition 1.2 and Remark 1.2). There, the consistency result is derived using the properties of MLE for 
exponential families and central limit theorem. The essential contribution of the above theorem is 
in the missing data cases (i) and (iii). We emphasize that case (i) holds for a general Q and missing 
data distribution as long as Go is identifiable in Q. In (iii) however, we require for Z to be MCAR 
in order to guarantee that the condition y/n\ n — > oo is necessary for consistent estimation. Below 
we make some further remarks. 

The claims of the theorem are established by verifying conditions (CI) and (C2) from Proposition 
3.2 for Go and hence, for any DAG equivalent to Go- Therefore, it follows from Corollary 3.3 that the 
theorem remains true if we replace Go by [Go] and G by [G] . The theorem thus provides conditions 
for consistent estimation of the equivalence class of Go . 

As evident from the proof of the theorem, the requirement A„ — > is needed for guaranteeing 
the first, (CI), consistency condition in Proposition 3.2, while y/nX n — > oo (nX n — > oo) is required 
for the second one (C2). The AIC selection criterion, A„ = 1/n, is not consistent for it satisfies 
(CI) but fails to satisfy (C2), regardless of j3. It will thus recover the true structure but will tend 
to select networks with higher complexities than the true one. Therefore AIC is prone to overfitting 
and so is any scoring function with nX n = 0(1). At the other end of the consistency spectrum of 
a, lim n sup X n > 0, the estimated networks will tend to have complexities below the true one. Due 
to the missingness, there is an implication regarding the BIC(MDL) criterion, X n = 0.51og(n)/n. 
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Because nlog(n)/n — > oo but v / nlog(n)/n — » 0, BIC is guaranteed to be consistent only in the 
complete data case and it will be, in general, inconsistent in MCAR settings (see the corollaries that 
follow). The numerical results presented in Section 5 confirm this conclusion. 

Theorem 4.1 requires the observation probability f3(Q) to be fixed. If we allow it to depend 
on n, case (ii) of the theorem arises from (i) if we have lim n j3 n (Q) = 1. Then n\ n — > oo is a 
sufficient consistency condition. There is no contradiction with case (iii), since then it must be that 
P(Zxi = 1, Zp ai = 1) = 1 and P(Zxi = 1, Z Pa o = 1) = 1, and hence Condition 4.1 fails. As evident 
from the proof of the theorem, when lim/3„ < 1, (i) and (iii) still hold. We leave undecided the last 
alternative < lim/?„ < lim/3„ = 1. 

Next, we argue that Condition 4.1 arises naturally in MCAR settings. In the probability space 

of all MCAR distributions for Z defined by the Borel sets in {u € [0, l] 2 _1 , Y^k=\ X u k < 1} ( a 
distribution u is defined by assigning each of the 2^ states of Z a probability value in [0,1] such that 
their sum is 1), the subspace of distributions for which P(Z Pa .\ Pa o = l\Zi = l,Z Pa o = 1) = or 1 
has Borel measure zero. We thus have the following consequences of Theorem 4.1 which extend 
Corollary 3.1 and 3.2, and essentially summarize the practical contribution of this investigation. 

Corollary 4.1. Let (Go,Po) be a non-maximal DBN and Q consist of all DAGs compatible with a 
node order of Go- Then, for almost all MCAR distributions, G is consistent estimator of Go if and 
only if X n — > and ^/n\ n — > oo. 

In the last statement we assume that Q comprises all DAGs on X and use the global identifiability 
of [G ]. 

Corollary 4.2. Provided that T(Gq) is non-empty, for almost all MCAR distributions, [G] is con- 
sistent estimator of [Go] if and only if X n — > and \/n\ n — s- oo. 

Note that, the non-emptiness of I(Gq) is required in order for any DAG equivalent to Go to be 
non-maximal and hence, for Condition 4.1 to hold. 

5. Numerical experiments 

With the number of possible DAGs being super-exponential to the number of nodes, the task of 
reconstructing a DBN from data is in general NP-hard. The MLE based problem (3.8) essentially 
requires exhausting all DAGs in Q. For the purpose of numerical illustration in this section we make 
two simplifying the inference assumptions - that the causal order of the nodes of the original DBN 
Go is known, as well as the maximum size of the parent sets of Go, its in-degree. We thus assume 
that the search set Q comprises all DAGs compatible with a true node order. By Corollary 3.1, 
when the missing data model is MCAR, Go is identifiable in Q. In our numerical experiments we 
use exclusively the complexity function df, which recall is given by df(G,9) = dim(Qo), and the 
decomposable scoring function (3.6). Then (3.8) can be solved by an efficient exhaustive search via 
dynamic programming, an approach that is implemented in the catnet package for R, [1]. We are 
aware that more general learning algorithms are available in the literature that can also accommodate 
available case analysis based on NAL. For example, one can implement a search based on local 
optimizations as described in [4] by replacing the usual log-likelihood with NAL. However, our 
goal here is not to compare different learning strategics but to empirically verify the conclusions of 
Theorem 4.1, which hold for all NAL-based estimators (3.8). 

The standard AIC and BIC model selection criteria are compared to scoring functions with A n = 
(1/N)n~ a for different choices of a <G (0,1). The factor 1/N, to some extent arbitrary, makes the 
penalization relatively small for not large n (note that NAL is of rate O(N)). For this choice of A n 
and small n, the estimator G therefore may over-fit the data but, provided the scoring criteria is 
consistent, df(G) should approach the true complexity as n increases. 
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Table 1 

Consistency results for a simulated 2-node network. Two possible models Go and Gi are considered as described in 
the main text. Shown are the percents of wrong selections (choosing the alternative Gi instead of the true model Go). 



a 






0.2 


0.3 


0.4 


0.5 


0.6 


0.7 


0.8 


BIC 


AIC 














n = 


10^ 













= 


l 


0.0 


0.0 


0.0 


0.3 


0.9 


3.5 


10.6 


2.8 


16.0 





= 


0.99 


0.0 


0.0 


0.0 


0.5 


1.7 


6.6 


17.0 


4.5 


22.9 


8 


= 


0.95 


0.0 


0.0 


0.2 


0.9 


3.8 


12.8 


24.0 


8.7 


31.2 







0.90 


0.0 


0.0 


0.0 


0.7 


6.9 


16.6 


31.5 


12.5 


37.0 





— 


0.75 


0.0 


0.0 


1.1 


7.0 


18.5 


29.9 


40.2 


27.3 


44.4 














n = 


10 d 













= 


1 


0.0 


0.0 


0.0 


0.0 


0.0 


0.2 


3.6 


0.7 


13.9 





= 


0.99 


0.0 


0.0 


0.0 


0.0 


0.0 


1.6 


13.3 


2.9 


33.5 





= 


0.95 


0.0 


0.0 


0.0 


0.0 


0.4 


12.1 


28.8 


17.1 


42.0 







0.90 


0.0 


0.0 


0.0 


0.1 


3.6 


19.1 


34.7 


23.0 


43.9 







0.75 


0.0 


0.0 


0.0 


1.9 


15.8 


33.2 


42.4 


36.2 


47.2 














n = 


10 4 















1 


0.0 


0.0 


0.0 


0.0 


0.0 


0.0 


0.8 


0.2 


15.0 







0.99 


0.0 


0.0 


0.0 


0.0 


0.0 


2.7 


24.7 


13.9 


44.1 







0.95 


0.0 


0.0 


0.0 


0.0 


1.5 


21.3 


37.7 


31.6 


47.8 







0.90 


0.0 


0.0 


0.0 


0.0 


7.0 


28.9 


41.5 


36.5 


47.5 







0.75 


0.0 


0.0 


0.0 


1.8 


22.3 


41.2 


50.5 


47.3 


53.8 














n = 


10 b 















1 


0.0 


0.0 


0.0 


0.0 


0.0 


0.0 


0.0 


0.0 


17.7 







0.99 


0.0 


0.0 


0.0 


0.0 


0.0 


11.8 


37.8 


35.8 


50.4 







0.95 


0.0 


0.0 


0.0 


0.0 


3.5 


31.3 


44.6 


43.3 


49.8 







0.90 


0.0 


0.0 


0.0 


0.0 


13.1 


36.1 


47.2 


46.0 


50.5 







0.75 


0.0 


0.0 


0.0 


1.0 


21.4 


38.5 


45.5 


45.3 


48.2 



5.1. Simulated 2-node network 

Here we consider a simplest possible example to verify the consistency of the NAL estimator (3.8). 
We generate samples from a model Go with 2 independent binary variables Xi and X2 (that is 
Pai = Pa 2 = 0) with marginal probabilities 6\ = (0.4,0.6) and 82 = (0.3,0.7), respectively. We 
assume that X\ ~i X2 and then the only alternative to Go BN model is G\ with Pai = and 
Pa2 = {^i}- We also assume that X2 is always observed {P{Z2 = 1) = 1) but Z\ is MCAR 
with different missing probabilities P(Z\ = 0) € {0,0.01,0.05,0.10,0.25}. The sample observation 
probability is then (3 = 1- P(Z 1 = 0). For each (3 and sample size n e {10 2 , 10 3 , 10 4 , 10 5 }, we 
generate 1000 samples D n and count how many times S(G\\D n ) > S(Go\D n ), that is, G = G\ 
and Gi is erroneously selected instead of Go- Table 1 summarizes results for different choices of 
the penalization parameter a as well as BIC and AIC. As expected, all considered scoring functions 
except AIC are consistent in the no missing data case ((3 = 1). In presence of missing values however, 
for scoring functions with a > 0.5 the percent of false model selections is significant; moreover, it 
increases when the proportion of missing values increases, suggesting inconsistency. In particular, the 
inconsistency of BIC is very pronounced for all (3 < 1 . Even when the proportion of missing values 
is only 1 percent, j3 = 0.99, the percent of wrong selections start from 4.5 for n = 10 2 and climbs to 
35.8 for n = 10 5 . The presented results are in strong support of the predictions of Theorem 4.1. 

5.2. Consistent estimation of the ALARM network 

Here we consider a well known in the literature benchmark network. ALARM, a medical diagnostic 
alarm message system for patient monitoring developed by [2] , is a typical example of belief propa- 
gation network as those employed in many expert systems. The DAG of ALARM has 37 nodes, 45 
directed edges, varying number of categories (2,3 and 4) and complexity of 473. We perform network 
reconstruction using both complete and MCAR missing data simulated from the network, in order 
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Fig 1. Estimating the ALARM network from complete and MCAR samples of size 2.5e5. For f3 = 1, 0.92, 0.84, 0.70, 
the so called complexity profile - the complexity of the estimated network on y-axis as a function of the penalization 
parameter a on x-axis - is shown in the range [0.25,0.5]. In presence of missing values (f3 < 1), the BIC selection 
(dash, horizontal) tends to move up and away from the true complexity of 473 (solid, horizontal), demonstrating the 
inconsistency of BIC. 




to confirm the effect of missingness on the model selection as predicted by Theorem 4.1. The graph 
structures of the estimated networks are compared to the original one by the so-called F-score, the 
harmonic mean of precision (TP /(TP + FP)) and recall (TP /(TP + FN)), where P and N refer 
to the presence and absence of directed edges, f-score of 1 represents perfect reconstruction. 

Missing data samples are simulated by deleting 1, 2 and 4 values from each sample record, com- 
pletely at random (so, there is not even 1 fully complete record in the samples). Since the maximum 
parent size is 3, in the first case, the probability to have no missing 3-node subset (Xi,Pai) is 
(T) /(Y) an d nence > the effective observation probability /3 from (4.2) is about 0.92. When 2 values 
per record are deleted, /3 drops to ( 3 3 5 ) / ( 3 3 7 ) « 0.84; when 4 values are deleted, /3 is about 0.70. The 
target set of models Q includes all DAGs with 37 nodes, maximum of 3 parents per node, compati- 
ble with the true node order. Under these constraints, the number of DAGs in Q is 1133. For each 
possible complexity t, the optimal network G(t) is found and a final selection is made according to 
their scores. 

Table 2 shows comparison results for 9 scoring criteria (a=0. 25, 0.3, 0.35, 0.4, 0.45, 0.5, 0.75, BIC and 
AIC) and 7 samples sizes, from 5e2 to 2.5e5. In the complete data case, the scoring functions with 
a 6 [0.4, 0.5] and BIC reconstruct the true network for all samples with n > 2.5e4. As predicted, in 
the missing data cases the score function for a — 0.5 and BIC become inconsistent due to ovcrfitting. 
This effect is more clearly demonstrated in Figures 1 and 2 that show the complexity profile functions 
for different experimental cases. According to Theorem 4.1, the complexity profiles in the (0,0.5) 
range should converge to the horizontal line of true complexity. In Figure 1 the sample size is kept 
fixed and we see that with the increase of the proportion of missing values (/? decreasing), the profiles 
depart from the line of true complexity. On the other hand, Figure 2 shows profiles of samples with 
fixed proportion of missing values but of increasing size. We observe that, although slow, the profiles 
get closer to the line of true complexity as n increases. It is also evident that the BIC selected 
complexity drifts up and away from the true one with the increase of the sample size, an indication 
for its inconsistency. 
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Table 2 

Model selection results for the ALARM network using complete samples (ft = 1) and samples following MCAR 
models with /3 = 0.84 and f3 = 0.70. Shown are the F-scores between the true network and the estimated ones. 
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Fig 2. Estimating ALARM from MCAR samples with fixed observation probability ft = 0.84. Shown are the complexity 
profiles of 6 samples of increasing size n. For n > le6, the complexity of the BIC estimates are off charts (> 900). 
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6. Conclusion 

We have addressed the problem of discrete Bayesian network estimation from incomplete data by 
maximizing a penalized log-likclihood scoring function. The essential step in our approach is replacing 
the usual log-likclihood with a sum of node-average log-likclihoods, the so-called NAL. We have 
motivated our decision with a more efficient utilization of the available data and have shown the 
connection between NAL optimization and EM algorithm. Although our setup allows the missing 
data distribution to be arbitrary as long as the true DAG structure remains identifiable, the latter 
rarely holds for general MAR models. As we have demonstrated however, in MCAR settings, the 
identifiability of the set of independence relations, which characterizes all networks equivalent to the 
true one, is always guaranteed. We have shown, Theorems 4.1, that in presence of missing values the 
NAL-based estimator (3.8) requires more stringent conditions on the penalization parameter X n to 
achieve consistency than in the complete data case. The discrepancy is due to the fact that in NAL 
each node may utilize different data subset for estimation thus reducing the overall convergence 
rate. Although the theorem guarantees consistency for penalties in a continuous range, choosing 
an optimal penalization parameter that performs well in finite sample settings is an open problem 
deserving further investigation. 

The scope of this article has been limited to discrete BNs for which self-contained proofs of the 
results have been derived. It is straightforward however to apply NAL-based estimation to other 
classes of parametric BNs, such as linear Gaussian networks. Then, as long as for any G, Gq C G, 
l(G\D„) — l(Go\D n ) is O p (n -1 / 2 ), in the missing, and O p (n _1 ), in the complete data case, Theorem 
4.1, with some technical modification of the proofs, seems to remain valid. Formulating idcntihability 
and consistency for available case analysis in more general graphical model settings is thus a subject 
of continuing interest. 

7. Proofs 

The next lemma is instrumental in the proof of Proposition 3.1. It shows that the (population) node 
log-likelihood l(X\A), X £ {Xi}fL l7 A C {Xi}f =l , is an increasing function of A with respect to the 
set inclusion operation. In complete data settings, this result is better know as non-negativity of the 
Kullback-Leibler divergence. 

Lemma 7.1. For any A, B C X and X G X such that Zb is independent of (X,A,B) given 
(Z = l,Z A = 1), we have l(X\A) < l(X\A,B). The inequality is strict if P(X\A,Z = 1,Z A = 1) ^ 
P(X\A,B,Z = 1,Z A = 1). 

Proof. Let 

6 ka = P(X = k\A = a,Z = 1, Z A = 1). 

By assumption 

Okab = P(X = k\A = a,B = b,Z = l,Z A = 1, Z B = 1) 
= P(X = k\A = a, B = b, Z = 1, Z A = 1) 

We can therefore write the expression Oka = ^beB^i^ = = a,Z = 1,Z A = l)9k a b- By the 
convexity of the function tn>( log(f ) we have 

0ka log(0 fco ) < P ( B = b \ A = a,Z=l,Z A = l)9 kab log(0 fco6 ), 

beB 

and the claim follows from 



l(X\A) = ^P(A= a\Z = 1,Z A = 1)^2 9 ka \og{6 ka ) 

a£A keX 
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< ]T P(B = b\A = a,Z=l,Z A = l)P{A = a\Z=l,Z A = l)J2 e kablog(e kab ) 

aEA,bEB keX 

= P(A = a,B = b\Z=l,Z A = l,Z B = l)J29 kab \og(9 kttb ) = l(X\A,B). 

a£A,beB keX 

The last inequality is strict if P{X\A, Z = 1,Z A = 1)^ P(X\A, B, Z = 1, Z A = 1). □ 
Proof of Proposition 3.1. Part (i) 

Let Go C G. The MCAR condition on Z and LMP imply that for every i and i"cX such that 
Y <g Xi and Y n Pa® — 0, the following two conditions hold 

(i) (Y, Zy) is independent of Xi given (Pa®, Zi = 1, Z Pa a = 1). 

(ii) Zy is independent of Pa° given (Zi = 1, Z Pa a = 1). 

For each i, since Pa® C Pa;, Pai\Pa° -< Xj, by (i) we have that (Pai\Pa°, Z Pa .\ Pa o) and Xi are 
independent conditionally on Pa?, and therefore for each j £ Pa® and f £ Pai\Pa°, 

6 iMm (G\G ) = P(Xi = k\Pa z = (jj'),Zi = l,Z Pai = 1) 
= P{Xi = k\Pa° = j, Z, = 1, Z Pa a = 1) = l/; , r 
Moreover, by (ii) applied to Z Pa .\ Pa o and Pa? 

Y 0ijj> = P{Pa1 = j\Zi = 1, 2pa, = 1) 

j'ePa,i\Pa'> 

= P(Pa°=j\Z i = l,Z P< = l) = 9l j , 

which implies 

l(Xi\Pa t ) = Y 2J 0i,jf Y 0iMjj') lo &®iMjf) 

j£Pa° j'£Pai\Pa° keXi 

We thus have l(G\G ) = l(G ). 
Part (ii) and Part (iii) 

By the definition of l(G\Go) in (3.2) and some summation manipulations we obtain 

N 

l(G\G ) = J2 J2 J2 Po(Xi = Xi,Pa i = xp ai )logPo(X i = Xi\Pa i =xp ai ) 

i—l xp ai £Pa,i xi^Xi 

N 

= EE F »( X = x)\ogP (X l = Xi \Pai = x Pai ) = Y Po(x-)logP G | Go (x), 
i=i iex rex 

where x indexes the states of X. Since ^2 x Po(x) = 1, Yl x Pg\G ( x ) = 1 an d the log-function is 
concave, we have 

l(G\G ) = Y P (x) log P g , Go (.t) < ]T p o(x) logPo(x) = Z(G ), 

a; a: 

with equality that is achieved only when P G | Go = Po- 

□ 
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Proof of Corollary 3.1. Let Go G Q be DBN with a node order and Q be a set of DAGs 
compatible with f2. We need to show that 1(G\Gq) < l(Go) for all G G Q for which Go ^ G. 
Note that for any G E Q, G U G is also a DAG compatible with f2 and by Proposition 3.1, 

l(Xi\Pai) < l{Xi\P ai U Pa°) = i(Jr<|Pa?) 

for all i. Moreover, because Go ^ G, there is an i such that Pa^ = (Pa°\F) for F, ^ Y C Pa° . Since 
P(Xi|Pet;, = 1, Zp ai = 1) ^ P(Ai|Pa°, = 1, Z Pa a = 1), because by definition Go is a minimal 
DAG compatible with P , by Lemma 7.1, ;(X,|Pa°\r) < Z(A 4 |Pa°), implying Z(G|G ) < Z(G ). □ 

The next result is used in the proof of Lemma 4.1. 

Lemma 7.2. Let 4> n = (j> + O^n" 1 / 2 ) and 9 n = 6 + O^n" 1 / 2 ) for 9 > 0. T/ien 

n log(0 n ) - 01og(0) - O^n" 1 / 2 ). (7.1) 
Proof. By applying Taylor expansion to the logarithm function, we can write 

4> n \og(9 n ) - 01og(0) - (0„ - 0)log(0) + — (6 n - 6)), 

Tin 

for some r\ n between 9 n and 9. Since — > p 0/0 < oo, the claims follows from the assumptions. 

□ 

Proof of Lemma 4.1. Let G £ Q has parent sets Pa^. For all i = 1, 2V, j € Pa; and k <E Xi, we 
define 

S,i,kj = 9i.j9i,kj log9i ! kj — 9ij9i t kj logfl^fcj, 

where Oij = 9ij(G\Go) and O^kj = 9i } kj(G\Go), and 9ij and <9^jy are the corresponding estimates. 
In this notation we have 

N 

l(Xi\Pai,D n ) - l(Xi\Pai,G ) = Y, 

i=i jePai keXi 

Note that when either 9i = 0, 9ij = or O^kj = holds, then the state (i, kj) will be unobservable 
and = 0. We thus may assume without loss of generality that 9i > 0, 9ij > and O^kj > 
for all i,j and k. Moreover, by Hocffding's inequality, for all k,j, e > 0, P(|#j,jy — Qi,kj\ > e) < 
2 exp(— 2rii,je 2 ) and hence fl^j = + O p (n -1 / 2 ) and §i t kj = 9i,kj + O p (n -1 / 2 ). We can therefore 
apply Lemma 7.2 to 9ij9i t kj and to infer that £n-j = O p (n -1 / 2 ), from which the claim follows. 

□ 

The following two lemmas are essential for the proof of Theorem 4.1. The first one extends Lemma 
7.2. 

Lemma 7.3. Let for m = l,...,k, 9 rn = 9 + O p (?i~ 1/2 ), 6>o > and 9 = J2 m lrn9 m , for j m > 
such that 7, n = 1. Then 

A = ^7 ro m log(0 TO ) -01og(0) = O^rT 1 ). (7.2) 

rn 

Proof. Note that £? m , and 7 m are all considered to be random variables. By applying Taylor 
expansion to the logarithm function, we can write 

9 m \og(6 m ) = 6 m [log(§) + l(9 m -§)- ^(L ~ Of], 

9 2 ? y m 
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for some r\ m between 9 m and 9. Since rj m — > p 9o > 0, by assumption, we have (9 m —9) 2 /r] 2 n = (^(n -1 ). 
After some algebra we obtain 

A = iJfim l0g(<U ~ 6m. l0g(9)} = Jrn9 2 m ~ P] + O p (n^) 

m m 

= sE^mli - °of - - 9 ) 2 } + O p {n- v ). 

m 

Finally, since 1/9 -> p l/9 < oo and (9 - 9 ) 2 = O^n' 1 ), (7.2) follows. □ 

The next lemma presents a central limit result for difference between sample and sub-sample 
averages. It consequently establishes a variability rate of nT 1 ' 2 for such differences. 

Lemma 7.4. Let for each n > 0, x\, he i.i.d. random variables with mean /i„ and variance 

a 2 such that a 2 — > a 2 < oo. Let for some fixed /3 € (0, 1), k n ~ Binom([3, n) and a n be a random 
draw without replacements of k n elements from the set {1, ...,n}. Then for 

1 1 - 



for almost every sequence {a n } n >i, we ftaue 



^„^ d AA(0,^a 2 ). (7.3) 



Proof. First, we rearrange the elements of S„ 



and define {yl}*Zi to be n ~£ n %i for i € a„, and {z^}*^" to be — i#j for j ^ a„. Hence 5„ = 

y» + •■• + »*" + 4 + ••• + <"*"• 

We are going to apply the Lindeberg-Feller CLT (see for example Th. 2.27 in [12]) to the triangular 
sequence 

A key observation is that for each n, since a n is a random draw without replacements, conditionally 
on a n , y n 's and z n 's are independent. Moreover, it is easy to verify that 

E(S n \k n ) = -( - - (« - M/-*) = 0, 

71 /Crj. 



and 

k n n—k n 
Var(,/^S„\k„\ = \ " nVnrfii 4 \ 4- \ " nVWz* ) = 
i=l i=l 

By the law of large numbers, k n /n — > [3 almost surely, and therefore 



k„ n—k n i i i n2 

Far(ViS„|fc„) = ]T nl/«r(|/;) + £ nFar«) = -( l "~ " j + (n - fc„)Vn- 



1-/3 



-0" 



2 



It is left to verify that for any e > 

k n n—k n 

£^(y«) 2l {v^l3/«,l>e} + ™£«) 2l {,/SK|> e } ->0, 

i=l i=l 
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almost surely for {a n } n >i- Indeed, the left-hand side sum equals 

— — — E[(x n ) l { n^ lXn>VKe} \ + —^—E[{x n ) l^^l^e}] 

and, by the assumptions, the above expectations converge to for almost every sequence {k n } n >i, 
and hence, for almost every {a„}„>i. Therefore the Lindeberg-Feller CLT is applicable and (7.3) 
holds. 

□ 

Proof of Theorem 4.1. We shall first outline some key steps in the proof. The following notion 
will be useful: for a given sample D n and two DAGs G\ = {Pa}} and G 2 = {Pa 2 }, we say that the 
NALs l(Gi\D n ) and l(G 2 \D n ) are estimated upon one and the same sample, if for every i, l(Xi\Pa\) 
and l(Xi\Paf) are estimated from one and the same subsample of D n , that is, for every x s E D n , 
(Xi,Pal) and (X i: Pa 2 ) are either both observed or both unobserved (missing) in x s . When (X i: Paj) 
is observed in all x s £ D n , we say that D n is complete with respect to (Xj,Paj). 

The essential problem of achieving consistent estimation is to decide between the true model Gq 
and a more complex model G\ in which Gq is nested. According to Lemma J^.l, the NAL scores 
l(G\\D n ) and l(Go\D n ) both converge to 1(Gq) at a rate of n -1 / 2 and so does their difference - this 
is essentially Lemma 7.2. If therefore the scoring function penalty X n converges to at a slower than 
ra -1 / 2 rate, in the limit, Gq would be preferred to G\ as a less complex model. The latter condition 
is sufficient for both complete and missing data (claim (i) of the theorem). However, it turns out 
that when l(G\\D n ) and l(GQ\D n ) are estimated upon one and the same sample, as in the complete 
data case, their difference converges at a faster rate of n^ 1 - this is essentially due to the result of 
Lemma 7.3. Then we can relax the necessary convergence rate of X n and still achieve consistency 
(claim (ii)). The application of Lemma 7.3 crucially depends on the condition: for every node Xi, if 
a sample is complete with respect to (Xi,Pa®) so it is with respect to (Xi, Pa}). In MCAR settings 
the latter does not hold and the difference l(G\\D n ) — l(GQ\D n ) has a persistent variability of order 
n~ x / 2 , due to a central limit result, Lemma 7.4- Consequently the condition on \ n to diminish at a 
rate slower than n~ x l 2 becomes both sufficient and necessary (claim (Hi)). 

The more formal proof follows. We shall prove consistency by verifying conditions (CI) and (C2) 
in Proposition 3.2. We first assume that G\,G 2 £ Q are such that Go C G\ and Go ^ G2. Then, 
by the identifiability of Go, Definition 3.1, we have /(G2IG0) < /(Gi|Go). Moreover, by Lemma 
4.1, regardless of the observation probability f3(Q) > 0, Z(Gi|D rl ) — > p Z(Gi|Go) and l(G2\D„) — > p 
l(G 2 \Go), and hence, for S = (Z(Gi|G ) - l(G 2 \G ))/2, P{l(G x \D n ) > l(G 2 \D n ) + 5) -> 1, as n -> 00. 
The consistency condition (CI) therefore holds because the sequence A„ diminishes with n. 

As in the proof of Lemma 4.1, without loss of generality we may assume that for all G € Q, 
1 = 1, N, j e Pa t {G) and k e X l} 9 t , 3 {G\G ) > and 9 %k] (G\G Q ) > 0. 
Part (i) 

Let now assume Go C G±, Gq C G 2 and h(G\) < h(G 2 ). To verify the consistency condition (C2) we 
need to find the rate of convergence of the random variable l(G\ \D n ) — l(G 2 \D n ). This rate depends 
on whether the data is complete or not. 

By Lemma 4.1, l(G\D n ) = 1{G\G Q ) + O^n" 1 / 2 ) and since Z(Gi|G ) = l(G 2 \G ), wc have 

Z(Gi|£>„) - l{G 2 \D n ) = O p (n-^ 2 ). 

The latter holds regardless of the observation probability P(Q) > 0. Therefore, the condition ^Jn\ n — > 
00 implies that the positive sequence X n (h(G 2 ) — h(Gi)) will overcome the likelihood difference 
l(G 2 \ D n )-l(Gi\ D n ) with probability approaching 1, that is, P(S(G\\D n ) > S(G 2 \D n )) 1, as n -> 
00. This proves the first part of the theorem. 
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Part (ii) 

In case of complete data, (3(G) = 1, we shall obtain a faster convergence rate of n _1 for the difference 
l(G\D n ) — l(Go\D n ), Go C G, which will prove the second part (ii) of the claim. 

We consider the sample average log-likelihood of the node Xi. Since Go C G, we have that 
Pcii = Pa®UY for some Y C {Xi}fL 1 , YDPa® = 0. Observe that in Go, Xi cannot have descendants 
in Y. Indeed, if there is a directed path Xi to X s £ Y in Go, this path cannot be in G also, for 
otherwise one would have the loop Xi to X s to Xi in G and G would not be a DAG. But if there 
is a path in Go that is not in G, then Go G, a contradiction. Therefore, by LMP, A* and Y are 
independent given Pa". 

In the usual notation, for j £ Pa", m £ Y and k £ Xi, 6i t kjm denotes the estimator of P(Xi = 
fc|Pa° = j, Y = m) and @i j m is the estimator of P(Pa° i = j,Y = to). We start with the expression 

l{Xi\Pai,D n ) - liXilPalDn) 

= E E E 0i,jm^i,kjm^°S0i,kjm) — E E 6i,j@i,kj l°g(#i,fcj) 

= E E ^A-,jy» (7-4) 



where 



By definition 



A-i.kj = E ®i,kjm l0g(4,fcjm) ~ log(^i,fcj)- 

and 6* 



'i,jm ^ ' v~< anu "i : kjm 

,m' ' n i,j'm' 2^ij' n i,j' n i,jm 

By the sample completeness, we have ^ m ni jm = n^j and ^ m rii t kjm = ^i.fcj, implying 



— Vi.kjm — ^ = = Vi.kj- U-O) 



S?',m n i,j'm n i,j 



If we set 7 m = itjm /0 it j, then ^ m 7m = 1 and X)m7m^,fejm = However, the latter are not 

guaranteed in incomplete settings because then we may have J2 m n i,jm < n i.j and(or) J2 m n i,kjm < 

We can now apply Lemma 7.3 with j m , 9i t kjm and 6ikj to infer that Atkj = O p (n _1 ) and 

l(X i \Pa h D n )-l(X i \Pa° i ,D n ) = O p (n- 1 ). (7.6) 

Therefore l(G\D n ) - l(G \D n ) = Op^ 1 ) holds for all G such that G C G. 

If both Gi and G2 contain Go, it follows that l(G2\D n ) — Z(Gi|J5„) = Op^nT 1 ). Therefore, the 
condition n\ n 00 implies that the positive sequence X„(h(G2) — h(G\)) will overcome the like- 
lihood difference l(G2\D„) — l(G\\D n ) with probability approaching 1, which concludes the second 
part (ii) of the theorem. 



Remark 7.1. (7.6) holds even for incomplete samples D n if they satisfy the property: (Xi,Pai) is 
complete in D n whenever (Xi,Pa®) is complete, or equivalently, l(Xi\Pat, D n ) and l(Xi\Pa®, D n ) 
are calculated upon one and the same subsample of D n . Lemma 7.3 is applicable in such cases 
because we still have Y, m n hjrn = TH,j an d Y, m n i,kjm = iH t kj> an d consequently, X) TO 7"» = 1 
and Ylm"fm@i,kjm = Oi.kj ■ Interestingly, if Z is MCAR with P(Zy\Zi = l,Z Pa o = 1) = a £ 
(0,1), then J2m n *:jm/ n hj a and J2 m n i,kjm/n^ k j -> v a, and consequently J2 m "fm ~> P 1 and 
J2m^frn,9i,kjm — @i,kj 0; this however is not enough for the claim in Lemma 7.3 to hold. 
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Part (iii) 

The last part of the theorem claims the necessity of condition (i) in case of incomplete sample that 
also satisfies Condition 4.1. Let i be an unique node index of G G Q for which the condition holds, 
that is, for j 7^ i, Pcij = Pa®, but Pai\Pa° 7^ 0. Without loss of generality we may assume that 
h(G) = h(Go) + 1 and that D n is complete with respect to (Xi,Pa®). We shall show that for X n 
such that lim -y/nAn < 00 

m^n^ooPiKXilPat, D n ) - l(Xi\Pc$,D n ) - K > 0) > 0, (7.7) 

which is equivalent to S to be inconsistent. 

Let D n = {i*}™ =1 be the n-subsample of D n for which (JQ,Pa,) is observed. Then we have 
l{Xj\Pa,i, D n ) = l(X.i\Pdi, D n ). Note that n is random and n ~ Binom(a,n), for a = P(Zp ai = 
l\Zi = 1, Z Pa0 = 1) G (0, 1), by Condition 4.1. 

For every probability tabic 9, we denote 



n 

i(x i |Po?,e, J D n ) = -y;i(x t |Pa?,ff), 

Ti < ' 



n 
t=i 



and 



where 



We have 



and 



1 " 



n 

s=l 



l(x\Pal9)= S ^^^hi 1 *)' 

«(Xi|Pa°,£) n ) = Z(X 4 |Pa°, for 9 n = arg max l{Xi\Pc$, 0, D n ) 

JpQ|Pa°,D„) = l{Xi\Pa%9 n ,D n ), for 0„ = or. 9 max I (X, | Pa°, 0, £>„). 



Next, we show that the convergence rate of the difference l(Xi\Pa®, 9 n , D n ) — l(Xi\Pa°,9o, D n ) 
is The function f(9) = Z(Xi|Pa°, 0, D ra ) has continuous first and second derivatives in a neigh- 
borhood of 9q. Since = (/ has a maximum at 9 n ), the Taylor's expansion of / at 9 = 9 n 
is 

Wo) = /(«») + O.5(0o - § n) T Q^\e*(9o - 9 n ), for 9* n G [0 o ,0»]. 
Moreover, the Hessian at #0 is bounded because 6*o is bounded away from and 9* n — > p 9$. Hence 

d 2 f d 2 f 

Because \\9 — 9 n \\ 2 = O p (n _1 ), we infer 

yfr(l(Xi\Pa1, 9 n ,D n ) - l(Xi\Pal 9 ,D n )) = O p ( n -^ 2 ). 

Similarly we have 

y/Z(l(Xi\Pal 9 n ,D n ) - liX^Pal 9 Q ,D n )) = Op^ 1 / 2 ). (7.8) 

Due to the MCAR assumption, Z Pa .\ Pa a is independent of (Xi,Pa®). This and Condition 4.1 
imply that D n is obtained from D n by random draws without replacements. Therefore, we can apply 
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Lemma 7.4 to the set {l(x s \Pa° i , #o)}"=i of i.i.d. random variables and its subset {i(5'|Pa°, #o)}tt=i- 
We thus infer 

^(l(Xi\Pc$,Oo,D n )-l(Xi\Pa<l,0 Q ,D n )) -> d Af(0, 7 ^ar(Z(A 4 |Pa°, O ))), (7.9) 

where 7 = (1 — a)/a > 0. Also note that Var(l(Xi\Pa°, 6 )) > by the identifiability of Go- 
Moreover, by Remark 7.1, (7.6) applied to D n yields 

^i(l(X i \Pa i ,D n )-l(X i \Pa i ,D n )) = O p {n- 1 ^). (7.10) 

Finally, we consider the difference implicated in (7.7) 

^(l(Xi\P ai , D n ) - liXilPaPi, D„)) - v^A„ 

apply (7.10) 

= yft{i{Xi\p<%, e n , b n ) - i{Xi\p<%, e n , £>„)) - V^K + o p ( n -^ 2 ) 

then use (7.8) 

= y/n(l(Xi\Pa%,6o,Dn) ~ Z(A 4 |Pa° A, A*)) - J7i\ n + O^n" 1 / 2 ) =: T n + O p (n" 1/2 ). 

Since \im y/nX n < 00, there is a subsequence n' such that lim \pn! A n < = Ao < 00 and for which, taking 
into account the convergence (7.9), we have 

T n . M{-\onVar{l(Xi\Pa!l, ))). 

Therefore lim P(T n > > 0) -> 1 - ^(X Q /y/jVar{l{X\Pa^, 6 ))) > 0, where $ is the c.d.f. of the 
standard normal distribution. Hence (7.7) is verified and with this the proof of the theorem. 
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